Improving OCR Accuracy for Classical Critical Editions
نویسندگان
چکیده
This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th century editions, further improvements can also be achieved with postprocessing. In particular, the progressive multiple alignment method applied to different OCR outputs based on the same images is discussed in this paper.
منابع مشابه
National Library of Australia
This article details the work undertaken by the National Library of Australia Newspaper Digitisation Program on identifying and testing solutions to improve OCR accuracy in large scale newspaper digitisation programs. In 2007 and 2008 several different solutions were identified, applied and tested on digitised material now available in the Australian Newspapers Digitisation Program beta service...
متن کاملImportant New Developments in Arabographic Optical Character Recognition (OCR)
Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities—has achieved Optical Character Recognition (OCR) accuracy rates for classical Arabic-script texts in the high nineties. These numbers are based on our tests of seven different Arabic-script texts of varying quality and typefaces, totaling over 7,000 lines (~400 pages, 87,000 words; see Table 1 for full details). The...
متن کاملImproving OCR Performance in Biomedical Literature Retrieval through Preprocessing and Postprocessing
Today’s information retrieval (IR) techniques are mostly text-based. As a consequence, some types of information are beyond the reach of text-based IR systems, which fail in situations where textual information can not be easily accessed, e.g. textual information in biomedical images and figures. To tackle such situations, we propose to augment IR systems with the ability to perform optical cha...
متن کاملGreek and Latin corpora with variants and conjectures : Mapping critical apparatuses onto reference text
The principal corpora currently available in classical literature, while quite thorough, are based on authoritative editions without critical apparatuses. However, philologists need to deal with textual variants attested by manuscripts and conjectures suggested by scholars through the centuries. This paper will explore some methods for information extraction applied to digitised apparatuses of ...
متن کاملHow to Face the Crisis of Legitimacy: The Transfer and Further Development of Methods of Access from Printed to Digital/Digitised Editions
All media provide media specific methods of access to information and therefore media change affects also these methods of access. But the change of media and hence access methods also raises the question of legitimacy of doing this, in terms of scholarly working as well as in terms of justification in the face of the funding general public financing research either directly by the government o...
متن کامل